Semi-Supervised Kernel PCA 



Christian Walder, Ricardo Henao, Morten M0rup and Lars Kai Hansen 

Informatics and Mathematical Modelling 
Technical University of Denmark, DK-2800 
{ chwa , rh , mm , lkh}@ imm .dtu.dk 



Abstract. We present three generalisations of Kernel Principal Compo- 
nents Analysis (KPCA) which incorporate knowledge of the class labels 
of a subset of the data points. The first, MV-KPCA, penalises within 
class variances similar to Fisher discriminant analysis. The second, LS- 
KPCA is a hybrid of least squares regression and kernel PCA. The final 
LR-KPCA is an iteratively reweighted version of the previous which 
achieves a sigmoid loss function on the labeled points. We provide a the- 
oretical risk bound as well as illustrative experiments on real and toy 
data sets. 



1 Introduction 

In Semi-Supervised Learning (SSL) we are given a set of data points, only some 
of which come with class labels, and wish to infer a function which classifies 
new points. Alternatively we may not require the function but only its value 
on the unlabeled points, as in transduction. A considerable amount of work has 
recently been done here, see e.g. |1I2) for an overview and [3] for a discussion of 
the problem. Our approach is most closely related to the class of discriminative 
algorithms exemplified by the transductive support vector machine or T-SVM 
0]. The classifying function of this natural semi-supervised extension of the 
(normal, or fully supervised) SVM can be written 

m 

J* = argminll/ll^ + cj ]T £(/(**), U) + c 2 £ U(f( Xj )), 
iec j=i 

where x, £ X (£j £ ±1) are the data points (labels), C the indices of the labeled 
points, and T~L a Reproducing Kernel Hilbert Space (RKHS). The labeled loss 
function proposed for the T-SVM is the usual hinge loss L(f(x), t) = (l—tf(x))+ 
of the normal SVM, while the unlabeled loss U(f(x)) = (1 — is the 

natural unlabeled analog of L, which we depict in Figure [TJ Although it appears 
to be as sensible as the SVM, the non-convexity of U makes the T-SVM much 
more difficult to handle, leading to various optimisation strategies [5]. 

In this paper we propose SSL algorithms which can be thought of either as 
generalisations of the (normally fully unsupervised) KPCA or as relaxations of 
the T-SVM in which U takes the simpler form U(f(x)) = -f(x) 2 . Although 
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Fig. 1. The transductive SVM loss function for unlabeled points. Penalisation of this 
non-convex loss favours values f(x) which are either sufficiently positive or sufficiently 
negative, but not too close to zero. 

this term is also non-convex (it is concave), it does lead to computational ad- 
vantages. In particular, choosing also a quadratic loss for L instead of the hinge, 
the problem is exactly solvable as we show in Section 13.31 This combination of 
quadratic losses with exact solution is our least squares or LS-KPCA. Building 
on the useful exact solvability of this (still non-convex) proxy for T-SVM, we 
then propose as logistic regression or LR-KPCA, an iteratively rewcighted ver- 
sion of LS-KPCA which gets closer to the T-SVM by implementing a sigmoidal 
loss function L, and utilising the exact solution of LS-KPCA in an inner loop. 

1.1 Overview and Organisation of the Paper 

We review KPCA in Section [5] from a slightly unusual functional perspective. 
Our derivation relies on the representer theorem [5] , which turns out to make the 
discussion of our SSL generalisations of KPCA rather clean and straightforward. 
These generalisations of KPCA make up Section G3 In Section l3~T1 we introduce 
MV-KPCA, which differs from KPCA in that the variance should be small over 
some prescribed subsets of the data. This is the simplest method we propose in 
that it is solved by a normal (generalised) eigenvalue problem. We argue in Sec- 
tion 13.21 that this formulation may be problematic. Addressing these problems, 
in Section l3T3l we introduce LS-KPCA, the method mentioned above with purely 
quadratic L and U , which enjoys the risk bound we present in Section [5] LS- 
KPCA represents a greater departure from KPCA than MV-KPCA, but can also 
be solved exactly due to [TJ. In Section l3"^31 we move further from KPCA, with an 
iterative reweighting scheme which utilises this exact solution in an inner loop 
in order to achieve a sigmoid rather than quadratic loss function, the intuition 
being that this may be more appropriate for classification problems. A simple 
yet numerically stable optimisation procedure for LS- and LR-KPCA is outlined 
in Section 2J In Section [5] we compare our algorithms to previous approaches, 
focussing on the Spectral Graph Transducer (SGT) of |S]. We present results on 
standard benchmark data sets in Section [7j and finish with some conclusions in 
Section [H 



2 Kernel PCA 



We treat KPCA [9] slightly differently than usual, as the problem of finding 

r = argmax £ (f( Xi ) - ^ j (1) 

subject to \\f\\ 2 H = 1, (2) 

where TL is the RKHS with kernel k(-, -). The Lagrangian function associated 
with this problem is 
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+ A(||/||i-i) 



For the Lagrangian dual we maximise L( f, A) over f G H. The representer 
theorem [6], then implies f*(x) = Y^=i oi*k{xi, x) for some a\, a%, . . . , a m £ K. 
Combined with the reproducing property f{x) = (/, k(x, ■)}■}{, we obtain the 
simplification of (JT]) and ([2} to 

a* = arg max a T (K T K - K T E m K) a (3) 
subject to a T Ka = 1, (4) 

where ify = k{x i: Xj) and -B m is a square matrix of size m with entries — , as 
it shall be throughout the paper. Imposing stationarity in a on the Lagrangian 
of §5§ and (UJ) we find that a* is the eigenvector a with largest eigenvalue A of 
the generalised eigenvalue problem 

Ka = A (K T K - K T E m K) a. (5) 

It is easy to verify that this formulation of KPCA is equivalent up to a re- 
normalisation to the original one [S] in its centered form. The simpler uncentered 
version assumes that the data is centered in feature space. This leads to a slightly 
different eigenproblem formed by replacing the term (K T K — K T E m K) with 
K T K in ([5]). Here we can recover the uncentered version of KPCA by replacing 
the objective in (JTJ) with Y^hLi f( x i) 2 - That is, the variance of the function values 
f(xi) assuming they have zero mean. 



3 Semi Supervised Kernel PCA 

We propose three means of incorporating label information into KPCA, with 
MV-KPCA (Section l3.ip incorporating a slightly different type of label informa- 
tion than LS- and LR-KPCA (Sections GO] and [33). 



3.1 Minimum Variance Kernel PCA (MV-KPCA) 



We begin with MV-KPCA, which incorporates knowledge of pairwise, or rather 
group- wise similarity. This is the simplest method in that it is solved by an eigen- 
problem very similar to that of KPCA, and also in that it involves one extra 
parameter rather than two. The idea of MV-KPCA is, reminiscent of the Ker- 
nel Fisher Discriminant [10] , to modify the constraint ([2]) by adding a loss term 
based on the within-class variances. Given prescribed index sets G\, G2, ■ ■ ■ , Gi C 
{1, 2, . . . m} of similar elements from the data set x±, X2, ■ ■ ■ , x m , the new con- 
straint is 

ii/ii«+ c EE [fM-w\ £ /( ^' ) J =1 ' 

j=i ieGj \ 131 j'eGj J 

where c £ R + trades between KPCA for c = and increasing penalisation of 
within class variance for larger values of c. Note that (as in our experiments 
in this paper) the Gj may be derived from categorical class labels by assigning 
points with the same label to the same group. Once again we can apply the 
representer theorem to the Lagrangian of this problem. Since the augmented ob- 
jective function is purely quadratic (like that of the original KPCA), its optimal 
solution is again found by an eigenproblem, this time 

( K K + C (j2 K i R i - K J E \G>\ K ^j ^a = X (K T K - K T E m K) a, (6) 

where Ki is the sub-matrix of K taking rows Gi . Note that it is not possible to 
obtain a convex problem by replacing the equality constraint with an inequality 
— although the resulting problem is equivalent but on a convex feasible region, 
in this case we would still be maximising a convex function over that region. 



3.2 Difficulties Parameterising MV-KPCA 

Since the objective function (JTJ) and constraint © in KPCA (and MV-KPCA) 
both scale the same way (quadratically), the maximisation of one with the other 
fixed is equivalent to the maximisation of the ratio of the two. This is often 
referred to as the Rayleigh quotient form of an eigenvalue problem [11] . A conse- 
quence is that changing the constant on the right hand side of ([2]) only rescales 
/*, which could be problematic as we now argue. The objective we are maximis- 
ing is 

VAR[/] (7 , 
max 5 , (7) 

ft" H/ll«+cEiVAR,[/] 

where VAR[/] is the variance of the values of / over all the Xi and VAR^ [/] is the 
variance of the values of / over the points indexed by G, . This looks like it may 
be interesting for SSL. After all, the objective function favours large variance on 



the unlabeled points while favouring small values of two fairly standard terms 
for regularisation and risk, namely an RKHS norm and a type of quadratic 
penalty. More importantly, although this may appear to be precisely the type of 
semi-supervised learning objective which tends to be hard to optimise due to its 
non- convexity, we can solve it in 0(m 3 ) time due to the convenient relationship 
with the eigenvalue problem ([5]). The unfortunate part however, is that due to the 
fact that changing the constraint in @ only multiplicatively scales the solution, 
there is no obvious way to trade between the numerator and the denominator of 
{7J in the same way we can trade off within the denominator via the parameter 
c. This could be critical in SSL problems in which there are vastly different 
numbers of labeled and unlabeled points. For a second multiplicative scaling 
parameter in the ratio (|7J| to be non-trivial however, at least one of the terms 
would need to scale non-quadratically. 

3.3 Least Squares Kernel PCA (LS-KPCA) 

Continuing the previous argument, although many non purely quadratic surro- 
gates for any of the three terms in ([7]) are possible, few of the interesting ones 
will lead to computational problems as straightforward as solving the eigenvalue 
problem Fortunately however, the classic squared loss does turn out to be 
fairly convenient. Hence, as LS-KPCA we now propose the following modifica- 
tion of KPCA. First, instead of maximising the first term (the objective ([T])) with 
the second term (the constraint (|5J)) fixed, we minimise the second with the first 
fixed. Actually, this is still KPCA as these problems are of course equivalent. 
Next, we add onto the new objective function a squared loss term, to get 



An example solution of the above problem is depicted in Figure [2j Unlike the 
original KPCA constraint and MV-KPCA, the above constraint does break the 
scale invariance of the ratio of the objective function and the constraint function 
and the problem cannot be written as a ratio similar to (JT)). In other words, the 
part of ([8]) which is linear in / makes the relationship between s and the corre- 
sponding optimal /* non-trivial. Furthermore, although the parameterisation is 
unusual, we are now able to control the relative importance of the three terms, 
VAR[/], 11/11^ and the squared error part of (jHJ), via the parameters c and s 2 . 
Applying the representer theorem as before yields 





subject to VAR[/] 



(9) 



a* = arg min a Ka + c\\Kca — t\ 



(10) 



subject to a T (K T K - K T E m K) a = s 2 



(11) 



where t g IRW is the sub-vector of y taking indices C, and Kc is the submatrix 
of K taking rows C. To solve (fTU|) and (fTTTl we can make use of the ideas in [7], 




(a) KPCA, first eigenfunction (b) KPCA, second eigenfunction 




(c) KPCA, third eigenfunction (d) LS-KPCA smail c 




(e) LS-KPCA medium c (f) LS-KPCA large c 



Fig. 2. An example in R 2 of unlabeled (white) and labeled (red/blue, four per class) 
points using a spherical Gaussian kernel on the two moons toy dataset. The value of 
/* (from equation |I} for (a)-(c) and © for (d)-(f)) is rendered in colour ranging 
from around +3 (dark red) to -3 (dark blue). LS-KPCA approaches KPCA for small c, 
hence there is a smooth transition (a)-(d)-(e)-(f), ignoring the sign change in (a) which 
is arbitrary for KPCA. 



which studies the problem 



z* = arg min z T Gz - 2b T z (12) 



subject to z T z = s 2 . (13) 

It is shown that the solution to this non-convex problem is z* = (G — A* I) 1 b, 
where A* is the smallest eigenvalue of the problem 

g -A h\ = x h 



Note that this result was used in a related context [8], as we discuss in Section 
[5] Making the change of variables 2 = P5a where 

P = K T K - K T E m K, 

we can use this result to derive that 

ot* = {C - CP) 1 b, (14) 

where C = K + cKjKc, b = cKct and is the smallest eigenvalue of the 
generalised eigenvalue problem 



-■H.T C M„J-< OP I-)' < 15 » 



The change of variables is unnecessary however, as we can repeat the arguments 
in [7] with the constraint in (fl~3|) replaced by z T Pz = s 2 . 



3.4 Logistically Loss via Reweighting (LR-KPCA) 

Generalising LS-KPCA to arbitrary L and U for the labeled and unlabeled loss 
functions, we get 

/*= arg min ||/||^ + c£> (/(x*), Vi ) (16) 
subject to u Ui x i)) = s 2 - ( 17 ) 

i 

Note that the U we intend here and for the remainder of the paper differs 
from that of the T-SVM formulation in Section [1] by a sign change. The purely 
quadratic losses of LS-KPA may not be appropriate for classification. Leaving L 
and U unspecified but abusing the notation by extending them element-wise to 
vectors, the representer theorem still applies, so we can write the problem in a 
as 

a* = arg min a T K a + ce T L(Ka, y) (18) 
subject to e T U(Ka) = s 2 , (19) 



where e is a vector of ones. We would like to use more sophisticated losses 
U and L in the above formulation. For this, it is natural to try to leverage 
the powerful result that we can solve the least squares formulation exactly, by 
employing the iteratively reweighted least squares idea |12) . This is essentially 
a Newton-Raphson method, but with the interpretation that each step solves a 
least squares problem with modified weights. 

A Newton-Raphson step solves a local second order approximation of the 
problem. To be able to apply the exact solution of the previous section, we are 
forced to choose a U which is purely second order (i.e. with no linear term). 
Then the local second order approximation of the constraint (|19l) is still purely 
second order (and exact), and the form of the optimisation problem remains 
that of LS-KPCA. Hence we maintain our initial choice U(f(x)) — f{x) 2 . We 
can try to improve on L however, as doing so does not change the form of the 
objective function on taking a local second order approximation, since the LS- 
KPCA objective (JTDJ) already has a linear part. Hence, motivated by logistic 
regression, as LR-KPCA we propose the sigmoid 

L(/(aj),y) = l/(l + exp(-y/(x)) 

as the loss term for labeled points. A Huber-like differentiable approximation of 
the support vector machine hinge loss could just as easily be used, however. 

By arguments similar to those of logistic regression fT5], we can solve (fT5]) - (Tn)]) 
by iteratively reweighted LS-KPCA. We can derive as usual that the following 
steps constitute a Newton-Raphson update. Given the current solution a n , we 
compute g,z,s G R' £ ' as g = Kcctn, and 

Zi = 1/ (1 + exp(-tigi)) , 
n = Zi(l - zi), 
Si =9i- (zi - tj)(l - Zi)/zi, 

for i = 1, 2, ... , \C\. The next iterate a n+1 is defined like a* of fT3|) and (fT5|) . 
but with a different C and 6, which now depend on r and s according to 

C = K + cKjRK'c, b = cKjRs, 

where R is diagonal with Rn = r». 

Due to the form of the logistic function (|2"0j) we have that r.; > and so 
the resulting Hessian C is always positive definite. As is also the case for nor- 
mal logistic regression however, we have no guarantee that it will improve the 
objective function, making some form of back-tracking line search necessary. 
Due to the constraint (|17p . this is not as simple as moving back on the line 
\a n + (1 — A)a„_i,0 < A < 1. Instead, in order to guarantee convergence we 
check the objective function, and as long as it is not better than the previ- 
ous iterate, we solve a modified problem with an additional regularisation term 
A||a n — a„_i|| 2 , where A is a parameter we increase until we see an improve- 
ment in the (unmodified) objective function. It is important to note that similar 
line search heuristics are also required in the iteratively reweighted maximimum 
likelihood solver of the standard logistic regression model. 



(20) 



(a) LS-KPCA 



(b) LR-KPCA 



Fig. 3. LS-KPCA (left) suffers here due to the value of s 2 in being large. This 
essentially enforces a large squared value of the function at certain points, conflicting 
with the squared loss which favours values near ±1, so that the energy of the function 
gets concentrated away from the labeled points. The sigmoid loss of LR-KPCA (right) 
can discount labeled points which are well classified, leaving the energy unhampered. 
The values range from around -50 (dark red) to +10 (dark blue) for LS-KPCA, and 
from around -3 to +3 for LR-KPCA. 



4 Efficient Solution 



We need to compute the C* of (|14l) . both for LS-KPCA and the inner loop of 
LR-KPCA. As argued in [TJ, doing so via the eigenvalue equation ([15]) directly 
can be highly unstable since the matrix on the left hand side is not symmetric, 
is twice the size of the normal KPCA eigenvalue problem, and is typically badly 
conditioned. From (fTT| . the solution a* must satisfy a T Pa — s 2 . As it was 
shown in [TJ, for C < 5 where S is the smallest eigenvalue of C, (TP21) and (Til?)) have 
a unique solution if and only if the characteristic polynomial (or secular equation) 
/(C) = a T Pa — s 2 — is satisfied, provided that /(C) is strictly increasing for 
C S (— oo, 8). As a result, to obtain a* we need to find the unique root of /(C) to 
the left of 8 — |u T b|/s, where u is the eigenvector with associated eigenvalue 8. 
Here we use Brent's method [T3], allowing C* to be calculated to high precision. 
Note that the uncentered VAR[/] allows us to make a straight-forward change 
of variables / = Ka and solve the problem in /. The resulting eigenvalue 
problem is of the form (fl2 ]) -(|13 |) . with an isotropic constraint, and can be solved 
more efficiently via (to use their terminology) the simpler explicit characteristic 
polynomial of [TJ, as opposed to the implicit one used here. Transforming to an 
isotropically constrained variable in this way would be possible for the centered 
formulation as well, however this transformation leads to numerical problems 
due to the required matrix inverse. Moreover, the inverse in (|14l) means that it 
is not possible to reduce the computational time complexity from cubic. 



5 Risk Bound 

It is straightforward to obtain a risk bound for LS-KPCA using [T3], with the 
analysis being similar to that of the SGT in that work. Assume the uncentered 
version of ©, so that P = K T K. From (fTO)) and (|TT|) we have that the soft de- 
cision function values z* = Ka*, an Unlabeled-Labeled Decomposition (ULD). 
Letting m = l+n, where I (resp. n) is the number of labeled (unlabeled) points, 
we can obtain an error bound for LS-KPCA 

Theorem 1 Q14J). 

Let z* = Kot* be the ULD of a transductive algorithm. Let \\at\\2 < /i, 
c = y32 ln(4e)/3 < 5.05 and r = l/l + 1/n. With probability of at least 1 — 6 
over the choice of the training set C from X , for all z in the set of all possible 
hypothesis that can be generated by the algorithm for a given X, all possible 
partitions and all possible labelings of the training set, the risklZ n (z) is bounded 
from above by 



To apply this to LS-KPCA, we replace z* = Ka* in jTfl} and flTT), eigen- 
decompose G = K~ 1 CK~ 1 = QDQ T (so that Q T Q = I) and put a = Q T z. 
We see z* = Qa* with a* as in (16). Since Q depends only on the data X, the 
latter is a ULD of LS-KPCA. From (12) a T a = s 2 and ||Q||| ro = q, where q is 
the rank of G. Hence 1Z n (z) is bounded from above by 



The second term is an upper bound on the Rademacher complexity of ULD 
algorithms, for the SGT it is \/2qr |14j . where q is the number of non-zero 
eigenvalues of the Laplacian. Since both algorithms use a squared loss for TZi, 
their bounds differ only by the Rademacher complexity, which may even be made 
equal by choosing s appropriately. This is to be expected, since as we explain in 
the next section LS-KPCA differs from the SGT only by its regulariser. 

6 Relationship to Other Methods 

Firstly, MV-KPCA is similar to the Kernel Fisher Discriminant, penalising a 
different set of variances but also leading to an eigenvalue problem [10]. Next, 
LS-KPCA is related to [8] and [15], the former relationship being the clearest. 
In particular, following [16] we can interpret Joachims' SGT [8] as a special case 
of LS-KPCA. To do this, in © and © we define the RKHS of functions as the 
set of real valued functions defined on the vertices of the graph and satisfying a 
particular linear constraint (normalised cut balancing constraint, related to our 
centered variance for ®) so that H = {f £ E m : f T e = 0} where e is a vector 





of ones. We define the graph Laplacian matrix L as in [8], and let the kernel 
matrix K be given by L + , the pseudo-inverse of L. If we further restrict to the 
simpler uncentered version of and use the fact that the first eigenvector 
of L is a scaled e, then simplifications lead to equations (19) and (20) in [8]. 
Hence the SGT is LS-KPCA with an RKHS defined by a graph based regulariser. 
Such regularisers have proven highly effective in SSL. A similar combination was 
proposed in [17] but with a non-convex gradient descent and more sophisticated 
loss functions. 

The RKHS derivation of SGT has various advantages. First, our experiments 
show that in some cases it can be more effective to use a normal Gaussian ker- 
nel RKHS regulariser rather than a graph based one. As we see in Section [JJ 
this happens particularly when the data density is adversarial in the sense of 
defying the so-called manifold assumption (that the data lie near a low dimen- 
sional sub- manifold of the input space [T7J), in which case the graph based reg- 
ulariser may be inappropriate. It is also straightforward to smoothly transition 
between the two regularisers as in 18 . This transitioning can equivalently be 
obtained by simply including the graph Laplacian regulariser as an additional 
term in (|16p of the form J^i j w ij (f( x i) ~ f( x j)) 2 i since the overall problem can 
be converted just as easily as before to an optimisation problem in a via the 
Lagrangian/representer theorem. Various other interesting options are straight- 
forward due to this flexibility. For example, problem specific invariances may be 
incorporated as in [19j by penalising loss terms which are functions of the gra- 
dient of /. This simply leads to an expansion for the optimal /* which contains 
gradients of the kernel function [6 , and this leads immediately to finite dimen- 
sional optimisations similar to those in a in our formulations. Finally, compared 
with the graph cut derivation of the SGT our RKHS derivation permits a natural 
out of sample extension. 

To complete our comparison with other methods let us finally mention LR- 
KPCA. This algorithm is a greater departure from previous work. It is related 
to logistic regression and of course LS-KPCA, but seems to be the first iterative 
algorithm to take advantage of the exact solution of LS-KPCA provided by [7] 
as part of an inner loop. 

7 Experiments and Discussion 

We tested on the six binary benchmark data sets of [2] as follows. 
7.1 Gaussian Kernel 

Each error in Table Q] corresponds to a mean (standard deviation) over the twelve 
test splits supplied with the data sets, for each of the two supplied cases: 10 (top 
half of table) and 100 (bottom half) labeled points. We used the Gaussian kernel 
k(x,y) = exp(— j\\x — y\\ 2 ), so for each split we had to choose the parameters 
c (for LS- and LR-KPCA), s (for LS-, LR- and MV-KPCA) and 7 (for all three 
and also KPCA). To choose these parameters for each split, we performed 10 





g241c 


g241d 


Digit 1 


USPS 


BCI 


Text 


Lk-KPOA 

LS-KPCA 
MV-KPCA 
MV-KPCA-10 
KPCA-10 


15.09 (4.57) 
14.70 (1.83) 
14.12 (2.28) 

36.34 (7.54) 
28.75 (4.26) 


49.71 (3.96) 
48.74 (3.91) 
50.24 (4.03) 
32.96 (8.56) 
32.98 (8.58) 


15.52 (4.01) 
13.86 (2.99) 
12.32 (3.69) 
21.37 (4.69) 
21.36 (4.67) 


21.41 (3.06) 
23.75 (3.20) 
62.02 (12.70) 
55.59 (3.92) 
35.27 (13.59) 


47.46 (1.75) 
48.44 (2.57) 
50.04 (0.89) 
48.46 (2.15) 
48.16 (2.91) 


32.5 (3.57) 
32.71 (3.00) 
32.^7 (3.35) 
34.74 (7.88) 
36.00 (9.76) 


LR-KPCA 

LS-KPCA 
MV-KPCA 
MV-KPCA-10 
KPCA-10 


12.64 (0.46) 
13.12 (0.45) 
12.82 (0.42) 

17.78 (1.63) 
18.11 (1.85) 


23.8 (3.00) 
22.93 (3.19) 
49.95 (1.64) 
17.04 (2-70) 
21.14 (3.06) 


4.60 (1.07) 
4-08 (1.34) 
3.89 (1.08) 
9.40 (1.59) 
7.42 (1.41) 


11.35 (1.87) 
7.51 (1.01) 
20.11 (5.61) 
10.08 (1.82) 
14.46 (1.53) 


32.25 (2.04) 
29.03 (2.20) 

49.33 (1.59) 
37.17 (3.76) 
48.50 (1.71) 


27.66 (2.88) 
24.96 (2.61) 
30.80 (2.03) 
27.50 (1.57) 
33.79 (3.59) 



Table 1. mean (std-dev) errors over 12 splits for 10 (top half) and 100 (bottom) labeled 
points out of 1500, on the benchmark sets of [2]. 



fold cross validation over the labeled points of that split. This model selection 
procedure can be problematic, especially for the 10 labels case in which the cross 
validation estimate is especially unreliable, but to be fair we always followed 
this procedure. Italics in the table indicate signficantly best results amongst our 
algorithms, in the sense of having a mean error less than the mean error plus 
standard deviation of the method with lowest mean. Bold indicates best mean 
result over all published results in the study [2]. 

The algorithms listed in Table [T] are the following. LS- and MV-KPCA cor- 
respond to using the first eigenfunction from those problems as the classifying 
function. For MV-KPCA-10 and KPCA-10 however we took the top ten eigen- 
functions, and used the values of these functions to train a hard margin, linear 
SVM (which sees these ten values on the labeled points only) . Our motivation for 
testing MV-KPCA-10 was that by penalising within class variances rather than 
a signed class label loss term, this algorithm may be flexible enough to extract 
multiple relevant features. We include KPCA-10 as a baseline for comparison. 
The SVM was also applied to the first eigenfunction of LR-, LS- and MV-KPCA, 
although there the optimisation is trivial and merely sets a threshold. 

Directly comparable error rates for eleven other algorithms are included in 
[2J, and the chapter with these numbers is freely available online. It turns out 
that we obtain best mean performance compared with all methods on g2^1c (for 
10 and 100 labels) and for BCI (for 100 labels only), and generally competitive 
performance overall. Comparing the different methods it turns out that one of 
the strongest competitors is the SGT 8 , which appears overall to be signifi- 
cantly better than our methods on these data sets. However, as explained in 
Section [|J] the SGT is in fact a special case of our LS-KPCA. The main difference 
lies in the graph based regularisation of the SGT rather than our plain Gaus- 
sian kernel RKHS norm regularisation. Other more subtle differences include 
Joachims' choice of graph connectivity and weights, use of a non-trivial spec- 
tral renormalisation of the graph Laplacian, and rebalancing based on relative 
class frequencies [5]. Since these options are also possible for the more general 
LS-KPCA, we can argue that LS-KPCA should be attributed with the best per- 
formance of the published SGT results and our results here. Most importantly, 
it is clear that the most meaningful features of our results are hence captured 



in the relative performances of our different methods. However, one point we 
can make with regard to absolute performance measures, is that our best per- 
formance on g241c (a sythetic data set composed as samples from two highly 
overlapping Gaussians, one per class) seems to indicate that LS- and MV-KPCA 
are able to handle non manifold like data effectively, presumably due to the non 
graph based regularisation. 

Our algorithms do utilise the unlabeled examples, as we significantly out- 
perform the purely supervised baseline methods [2]. This is further evidenced 
by the fact that KPCA-10 is never significantly better than the other variants, 
and often significantly worse. The mediocre performance of KPCA-10 on these 
datasets is surprising, as this algorithm seems reasonable for SSL, and was pre- 
viously proposed for exactly that [3D]. We also found that LR-KPCA is rather 
similar to LS-KPCA. This does not seem to be due to computational problems 
since the iterative reweighting scheme always converged to high precision within 
20 iterations. It may be due to the coarse grid we were forced to use for the cross 
validation parameter search of LR-KPCA, due to its being rather expensive to 
solve. It is expensive since each re-weighting step requires the LS-KPCA type 
solution, which itself requires of the order of 10 matrix inverses of size m during 
the zero finding phase described in Section^] Although expensive, a more refined 
search would be possible with sufficient computational resources, but would pre- 
sumably only lead to modest improvement. Rather, it seems that the squared 
loss of LS-KPCA is reasonable for these problems, which agrees with the fact 
that a significant amount of work has been in precisely the opposite direction 
to our LS-KPCA LR-KPCA. By this we mean the least squares SVM [21] 
where the hinge loss is actually replaced by a squared loss for classification (al- 
though the use of a squared classification loss is relatively uncommon overall in 
the literature). Moreover, in SSL the labeled loss term plays a diminished role 
in comparison to normal supervised learning as in the LS-SVM. Nonetheless, 
LR-KPCA performs strongly and intuitively on the two moons toy dataset as 
depicted in Figure [3] 

7.2 Combined Graph Diffusion and Gaussian Kernel 

The main difference between our LS-KPCA and the SGT [5] is the extra degree of 
freedom afforded the fact that LS-KPCA may utilise an arbitrary kernel function, 
rather than being restricted to a graph based representation. To demonstrate the 
value of this degree of freedom we now experiment with LS-KPCA using a convex 
combination of a graph diffusion and a normal Gaussian kernel, namely 

K = wK~ ( + (1 — w) exp(— tL), 

where K-y is the kernel matrix associated with the Gaussian kernel as in the previ- 
ous sub-section. L is the normalised graph Laplacian defined by L = diag (Se) — 

_ i _ i 

S, where S — diag (We) 2 VFdiag (We) 2 . Here W is the edge weight ma- 
trix for the graph. To construct W we employed a standard nearest neighbour 
connectivity, and assigned Gaussian edge weights with respect to the pairwise 





Digitl 


USPS 


BCI 


g241c 


COIL 


g241d 


Text 


G-LS 
M-LS 


18.15 (7.79) 
15.95 (7.11) 


25.37 (13.47) 
25.70 (11.38) 


48.93 (2.33) 
48.93 (1.74) 


42.29 (7.91) 
36.20 (13.70) 


64.42 (5.21) 
73.58 (10.87) 


46.81 (4.00) 
47.66 (3.80) 


41.85 (7.08) 
39.92 (7.60) 


G-LS 
M-LS 


2.70 (0.95) 
3.88 (3.00) 


5.75 (1.28) 
5.70 (2.43) 


48.06 (2.64) 
37.33 (7.93) 


29.36 (6.19) 
16.63 (3.54) 


25.33 (10.96) 
31.98 (23.72) 


32.50 (6.06) 
24.02 (4.34) 


26.67 (2.35) 
25.71 (2.04) 



Table 2. Mean and standard deviation percentage errors for 10 (top half of table) and 
100 (bottom half) labeled points out of 1500. G-LS is LS-KPCA with a graph diffusion 
kernel, while M-LS is LS-KPCA with a combined Gaussian and graph diffusion kernel. 



Euclidean distance of connected vertices. We set the bandwith of this edge weight 
Gaussian to be equal to the mean squared pairwise distance between connected 
points, and removed self connections so that W has zero entries on the main 
diagonal. Starting with w = (pure graph kernel), we chose the number of 
nearest neighbours, r, and the parameters c and s of (JS])-© using a leave one 
out procedure which we accelerated by exploiting Cholesky up- and down-dates. 
This pure graph kernel based method is listed as G-LS in Table [3J Given those 
optimal parameters, we then fixed r and the number of nearest neighbours, and 
then selected w, 7, c and s again using leave one out, in order to asses the rela- 
tive benefit of using a non graph based kernel in this experimental setting. The 
results for the case of mixing the Gaussian and diffusion kernels is listed in Table 
fj] as M-LS. We see that incorporating the Gaussian kernel in this manner never 
degrades the performance of the pure graph based algorithm, while for data sets 
g241c and BCL it significantly improves the performance. 

8 Conclusions 

We have proposed three variants of KPCA for semi-supervised learning. All 
three are able to benefit from the unlabeled data, and lead to competitive overall 
results on benchmark sets. LS-KPCA generalises the powerful SGT algorithm 
[8], thereby admitting various alternative algorithms due to the flexibility of 
the RKHS setting. Moreover, our RKHS based derivation of the more general 
case is conceptually cleaner than that of the SGT, which was originally derived 
from a relaxed spectral graph cut perspective. Both LS-KPCA and the SGT 
utilise [7 to obtain the globally optimal solution to their non-convex optimisation 
problems. We interpret this as a useful tool for problems related to that of the 
T-SVM, by considering the variance term in LS-KPCA as analogous to the 
T-SVM unabeled loss function. We also proposed the more sophisticated LR- 
KPCA, which implements a classification oriented sigmoid loss function via a 
reweighting scheme. This reweighting scheme also utilises the globally optimal 
solution of LS-KPCA in an inner loop. 

Generally speaking, we believe that the formulations in [7] are powerful, and 
perhaps under-utilised in machine learning. We hope to uncover a family of 
interesting algorithms (particularly for semi supervised learning) by studying 
re-weighted versions of these formulations. Here we presented an iterative re- 
weighting of the loss in (fT6|) (as in LR-KPCA). Also interesting is the possibility 



of reweighting the summand in (|17[) in order to obtain more sophisticated unla- 
beled loss terms. This is more complex, and we plan to investigate this direction 
in the future. 
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